Skip to content

feat: improve search highlighting system#39

Merged
kdroidFilter merged 6 commits intomasterfrom
fix/hallucination-blacklist
Jan 20, 2026
Merged

feat: improve search highlighting system#39
kdroidFilter merged 6 commits intomasterfrom
fix/hallucination-blacklist

Conversation

@kdroidFilter
Copy link
Owner

@kdroidFilter kdroidFilter commented Jan 20, 2026

Summary

  • Add hallucination blacklist to filter incorrect LLM-generated dictionary mappings
  • Filter hallucinations only for highlighting (not search) to maintain good recall
  • Exclude 2-letter terms from dictionary expansion highlighting
  • Improve snippet positioning to show text where most query terms cluster together
  • Add buildHighlightTerms() API for intelligent find-in-page mode

Technical Details

  • Blacklist loaded from hallucination_blacklist.tsv resource file for easy modification
  • Snippet algorithm optimized: max 5 occurrences per term, early exit on perfect match
  • New public method buildHighlightTerms(query) exposes dictionary expansion for reuse

Test plan

  • Search for "לחתוך צנון בסכין בשרי" - results should be relevant
  • Search for "כי ביצחק יקרא לך זרע" - snippet should show the actual verse
  • Verify highlighting doesn't include unrelated 2-letter words
  • Test buildHighlightTerms returns expanded terms with hallucination filtering

…sions

Add a blacklist mechanism to filter out LLM-generated dictionary mappings
that incorrectly link unrelated Hebrew words. The blacklist is loaded from
a TSV resource file for easy modification.

This is a temporary workaround until the lexical dictionary itself can be
corrected.
Keep full dictionary expansions for search to maintain good recall,
but filter out hallucinated mappings only when building highlight terms
to avoid highlighting unrelated words.
2-letter words should only be highlighted if they were explicitly
written in the query, not when they come from dictionary expansion.
Instead of centering the snippet around the first occurrence of any
anchor term, find the position where the most query terms cluster
together. This ensures the snippet shows the most relevant part of
the text when multiple query terms appear scattered throughout.
- Inline findBestAnchorPosition into buildSnippetInternal
- Limit to 5 occurrences per term to bound search space
- Early exit when all terms found clustered together
- Use simpler data structures (Pair instead of data class)
Add public method to build highlight terms with dictionary expansion,
filtered for hallucinations and 2-letter terms. This enables find-in-page
to highlight the same words as global search.
@kdroidFilter kdroidFilter changed the title fix: improve search highlighting accuracy feat: improve search highlighting system Jan 20, 2026
@kdroidFilter kdroidFilter merged commit 19f0e6d into master Jan 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant